This practical is based on exploratory data analysis and prediction of a dataset derived from a municipal database of healthcare administrative data. This dataset is derived from Vitoria, the capital city of EspĂrito Santo, Brazil (population 1.8 million) and was freely shared under a creative commons license.
Generate an rmarkdown report that contains all the necessary code to document and perform: EDA, prediction of no-shows using XGBoost, and an analysis of variable/feature importance using this data set. Ensure your report includes answers to any questions marked in bold. Please submit your report via brightspace as a link to a git repository containing the rmarkdown and compiled/knitted html version of the notebook.
The Brazilian public health system, known as SUS for Unified Health System in its acronym in Portuguese, is one of the largest health system in the world, representing government investment of more than 9% of GDP. However, its operation is not homogeneous and there are distinct perceptions of quality from citizens in different regions of the country. Non-attendance of medical appointments contributes a significant additional burden on limited medical resources. This analysis will try and investigate possible factors behind non-attendance using an administrative database of appointment data from Vitoria, EspĂrito Santo, Brazil.
The data required is available via the course website.
1 Use the data dictionary describe each of the variables/features in the CSV in your report.
PatientID: A unique identifier for each patient. AppointmentID: A unique identifier to each appointment Gender: The patient Gender (limited to Male or Female) ScheduledDate:The date on which the appointment was scheduled AppointmentDate:The date of the actual appointment Age: Patient age Neighbourhood: District of VitĂłria in which the appointment SocialWelfare: Patient is a recipient of Bolsa FamĂlia welfare payments Hypertension: Patient previously diagnoised with hypertensio (Boolean) Diabetes: Patient previously diagnosed with diabetes (Boolean) AlcoholUseDisorder: Patient previously diagnosed with alcohol use disorder (Boolean) Disability: Patient previously diagnosed with a disability (severity rated 0-4) SMSReceived: At least 1 reminder text sent before appointment (Boolean) NoShow: Patient did not attend scheduled appointment (Boolean: Yes/No)
2 Can you think of 3 hypotheses for why someone may be more likely to miss a medical appointment? 1 Longer wait times between scheduling and appointment date increase no-shows. 2 Patients with certain chronic conditions (e.g., those requiring frequent visits) might have a higher no-show rate due to appointment fatigue or perceived stability of their condition. 3 Socioeconomic factors, such as reliance on social welfare or living in certain neighborhoods, correlate with higher no-show rates. This could be due to transportation difficulties, inability to take time off work, or other systemic barriers.
3 Can you provide 3 examples of important contextual
information that is missing in this data dictionary and dataset that
could impact your analyses e.g., what type of medical appointment does
each AppointmentID refer to?
1 Type of Medical Appointment. Knowing if the appointment is for a
general check-up, a specialist consultation (e.g., cardiology,
pediatrics), a diagnostic test, or a follow-up could be crucial. 2
Previous Appointment History of the Patient 3 Severity or Urgency of the
Medical Condition: Information about why the appointment was scheduled
(e.g., acute pain, chronic condition management, preventive care) is
missing. More urgent conditions might have lower no-show rates.
4 Modify the following to make it reproducible i.e., downloads the data file directly from version control
raw.data <- read_csv(
"https://raw.githubusercontent.com/maguire-lab/health_data_science_research_2025/master/static_files/practicals/lab1_data/2016_05v2_VitoriaAppointmentData.csv"
)
## Rows: 110527 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Gender, Neighbourhood, NoShow
## dbl (9): PatientID, AppointmentID, Age, SocialWelfare, Hypertension, Diabet...
## dttm (2): ScheduledDate, AppointmentDate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now we need to check data is valid: because we specified col_types and the data parsed without error most of our data seems to at least be formatted as we expect i.e., ages are integers
raw.data %>% filter(Age > 110)
## # A tibble: 5 Ă— 14
## PatientID AppointmentID Gender ScheduledDate AppointmentDate Age
## <dbl> <dbl> <chr> <dttm> <dttm> <dbl>
## 1 3.20e13 5700278 F 2016-05-16 09:17:44 2016-05-19 00:00:00 115
## 2 3.20e13 5700279 F 2016-05-16 09:17:44 2016-05-19 00:00:00 115
## 3 3.20e13 5562812 F 2016-04-08 14:29:17 2016-05-16 00:00:00 115
## 4 3.20e13 5744037 F 2016-05-30 09:44:51 2016-05-30 00:00:00 115
## 5 7.48e14 5717451 F 2016-05-19 07:57:56 2016-06-03 00:00:00 115
## # ℹ 8 more variables: Neighbourhood <chr>, SocialWelfare <dbl>,
## # Hypertension <dbl>, Diabetes <dbl>, AlcoholUseDisorder <dbl>,
## # Disability <dbl>, SMSReceived <dbl>, NoShow <chr>
We can see there are 2 patient’s older than 110 which seems suspicious but we can’t actually say if this is impossible.
5 Are there any individuals with impossible ages? If
so we can drop this row using filter i.e.,
data <- data %>% filter(CRITERIA)
raw.data %>% filter(Age > 110)
## # A tibble: 5 Ă— 14
## PatientID AppointmentID Gender ScheduledDate AppointmentDate Age
## <dbl> <dbl> <chr> <dttm> <dttm> <dbl>
## 1 3.20e13 5700278 F 2016-05-16 09:17:44 2016-05-19 00:00:00 115
## 2 3.20e13 5700279 F 2016-05-16 09:17:44 2016-05-19 00:00:00 115
## 3 3.20e13 5562812 F 2016-04-08 14:29:17 2016-05-16 00:00:00 115
## 4 3.20e13 5744037 F 2016-05-30 09:44:51 2016-05-30 00:00:00 115
## 5 7.48e14 5717451 F 2016-05-19 07:57:56 2016-06-03 00:00:00 115
## # ℹ 8 more variables: Neighbourhood <chr>, SocialWelfare <dbl>,
## # Hypertension <dbl>, Diabetes <dbl>, AlcoholUseDisorder <dbl>,
## # Disability <dbl>, SMSReceived <dbl>, NoShow <chr>
raw.data %>% filter(Age < 0)
## # A tibble: 1 Ă— 14
## PatientID AppointmentID Gender ScheduledDate AppointmentDate Age
## <dbl> <dbl> <chr> <dttm> <dttm> <dbl>
## 1 4.66e14 5775010 F 2016-06-06 08:58:13 2016-06-06 00:00:00 -1
## # ℹ 8 more variables: Neighbourhood <chr>, SocialWelfare <dbl>,
## # Hypertension <dbl>, Diabetes <dbl>, AlcoholUseDisorder <dbl>,
## # Disability <dbl>, SMSReceived <dbl>, NoShow <chr>
# raw.data <- raw.data %>% filter(Age >= 0 & Age <= 110)
First, we should get an idea if the data meets our expectations,
there are newborns in the data (Age==0) and we wouldn’t
expect any of these to be diagnosed with Diabetes, Alcohol Use Disorder,
and Hypertension (although in theory it could be possible). We can
easily check this:
raw.data %>% filter(Age == 0) %>% select(Hypertension, Diabetes, AlcoholUseDisorder) %>% unique()
## # A tibble: 1 Ă— 3
## Hypertension Diabetes AlcoholUseDisorder
## <dbl> <dbl> <dbl>
## 1 0 0 0
We can also explore things like how many different neighborhoods are there and how many appoints are from each?
count(raw.data, Neighbourhood, sort = TRUE)
## # A tibble: 81 Ă— 2
## Neighbourhood n
## <chr> <int>
## 1 JARDIM CAMBURI 7717
## 2 MARIA ORTIZ 5805
## 3 RESISTĂŠNCIA 4431
## 4 JARDIM DA PENHA 3877
## 5 ITARARÉ 3514
## 6 CENTRO 3334
## 7 TABUAZEIRO 3132
## 8 SANTA MARTHA 3131
## 9 JESUS DE NAZARETH 2853
## 10 BONFIM 2773
## # ℹ 71 more rows
6 What is the maximum number of appointments from the same patient?
max_appts_patient <- raw.data %>%
group_by(PatientID) %>%
summarise(NumberOfAppointments = n()) %>%
arrange(desc(NumberOfAppointments)) %>%
head(1)
cat("The maximum number of appointments from the same patient is:", max_appts_patient$NumberOfAppointments, "\n")
## The maximum number of appointments from the same patient is: 88
print(max_appts_patient)
## # A tibble: 1 Ă— 2
## PatientID NumberOfAppointments
## <dbl> <int>
## 1 8.22e14 88
Let’s explore the correlation between variables:
# let's define a plotting function
corplot = function(df){
cor_matrix_raw <- round(cor(df),2)
cor_matrix <- melt(cor_matrix_raw)
#Get triangle of the correlation matrix
#Lower Triangle
get_lower_tri<-function(cor_matrix_raw){
cor_matrix_raw[upper.tri(cor_matrix_raw)] <- NA
return(cor_matrix_raw)
}
# Upper Triangle
get_upper_tri <- function(cor_matrix_raw){
cor_matrix_raw[lower.tri(cor_matrix_raw)]<- NA
return(cor_matrix_raw)
}
upper_tri <- get_upper_tri(cor_matrix_raw)
# Melt the correlation matrix
cor_matrix <- melt(upper_tri, na.rm = TRUE)
# Heatmap Plot
cor_graph <- ggplot(data = cor_matrix, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "darkorchid", high = "orangered", mid = "grey50",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 8, hjust = 1))+
coord_fixed()+ geom_text(aes(Var2, Var1, label = value), color = "black", size = 2) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank())+
ggtitle("Correlation Heatmap")+
theme(plot.title = element_text(hjust = 0.5))
cor_graph
}
numeric.data = mutate_all(raw.data, function(x) as.numeric(x))
## Warning: There were 3 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `Gender = (function (x) ...`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
# Plot Correlation Heatmap
corplot(numeric.data)
Correlation heatmaps are useful for identifying linear relationships
between variables/features. In this case, we are particularly interested
in relationships between NoShow and any specific
variables.
7 Which parameters most strongly correlate with
missing appointments (NoShow)?
The strongest of these (in terms of magnitude) is ScheduledDate with -0.16, followed by SMSReceived with 0.13.
8 Are there any other variables which strongly correlate with one another?
Age and Hypertension have a strong positive correlation 0.50. Hypertension and Diabetes have a notable positive correlation 0.43. Age and Diabetes have a positive correlation 0.29. ScheduledDate and PatientID have a correlation of 0.16. AppointmentDate and ScheduledDate have a very strong positive correlation 0.61. AppointmentDate and PatientID have a correlation of 0.37.
9 Do you see any issues with PatientID/AppointmentID being included in this plot? 1 Identifiers, not quantitative features. These are unique identifiers for patients and appointments. Their numerical values are arbitrary and don’t represent a meaningful quantity that can be correlated in a statistically useful way with other features or the outcome.
2 Any observed correlations involving these IDs (e.g., PatientID with ScheduledDate or AppointmentDate) are likely coincidental or due to the data generation process (e.g., IDs assigned sequentially over time) rather than indicating a meaningful relationship relevant to predicting no-shows.These should generally be excluded before calculating a feature correlation matrix.
Let’s look at some individual variables and their relationship with
NoShow.
ggplot(raw.data) +
geom_density(aes(x=Age, fill=NoShow), alpha=0.8) +
ggtitle("Density of Age by Attendence")
There does seem to be a difference in the distribution of ages of people
that miss and don’t miss appointments.
However, the shape of this distribution means the actual correlation is
near 0 in the heatmap above. This highlights the need to look at
individual variables.
Let’s take a closer look at age by breaking it into categories.
raw.data <- raw.data %>% mutate(Age.Range=cut_interval(Age, length=10))
ggplot(raw.data) +
geom_bar(aes(x=Age.Range, fill=NoShow)) +
ggtitle("Amount of No Show across Age Ranges")
ggplot(raw.data) +
geom_bar(aes(x=Age.Range, fill=NoShow), position='fill') +
ggtitle("Proportion of No Show across Age Ranges")
10 How could you be misled if you only plotted 1 of these 2 plots of attendance by age group?
The key takeaway from this is that number of individuals > 90 are very few from plot 1 so probably are very small so unlikely to make much of an impact on the overall distributions. However, other patterns do emerge such as 10-20 age group is nearly twice as likely to miss appointments as the 60-70 years old.
Next, we’ll have a look at SMSReceived variable:
ggplot(raw.data) +
geom_bar(aes(x=SMSReceived, fill=NoShow), alpha=0.8) +
ggtitle("Attendance by SMS Received")
ggplot(raw.data) +
geom_bar(aes(x=SMSReceived, fill=NoShow), position='fill', alpha=0.8) +
ggtitle("Proportion Attendance by SMS Received")
11 From this plot does it look like SMS reminders increase or decrease the chance of someone not attending an appointment? Why might the opposite actually be true (hint: think about biases)?
The data, as plotted, shows an association where SMS recipients have a higher no-show rate. However, this is very likely not a causal relationship where SMS causes no-shows. Instead, it’s more probable that SMS messages are being sent to a group of appointments/patients that are already at a higher baseline risk of non-attendance due to other factors. Without a controlled experiment (e.g., randomly assigning SMS reminders to comparable groups), it’s impossible to determine the true effect of the SMS reminders themselves from this observational data.
12 Create a similar plot which compares the the
density of NoShow across the values of disability
#Insert plot
raw.data$Disability <- as.factor(raw.data$Disability)
ggplot(raw.data) +
geom_bar(aes(x=Disability, fill=NoShow), alpha=0.8) +
ggtitle("Attendance by Disability Status") +
theme_minimal()
ggplot(raw.data) +
geom_bar(aes(x=Disability, fill=NoShow), position='fill', alpha=0.8) +
ggtitle("Proportional Attendance by Disability Status") +
labs(y = "Proportion") +
theme_minimal()
Now let’s look at the neighbourhood data as location can correlate highly with many social determinants of health.
ggplot(raw.data) +
geom_bar(aes(x=Neighbourhood, fill=NoShow)) +
theme(axis.text.x = element_text(angle=45, hjust=1, size=5)) +
ggtitle('Attendance by Neighbourhood')
ggplot(raw.data) +
geom_bar(aes(x=Neighbourhood, fill=NoShow), position='fill') +
theme(axis.text.x = element_text(angle=45, hjust=1, size=5)) +
ggtitle('Proportional Attendance by Neighbourhood')
Most neighborhoods have similar proportions of no-show but some have much higher and lower rates.
13 Suggest a reason for differences in attendance rates across neighbourhoods.
Access to Transportation : Some neighborhoods may have better public transportation links to healthcare facilities than others. Lack of reliable or affordable transportation is a common reason for missed appointments.
Now let’s explore the relationship between gender and NoShow.
ggplot(raw.data) +
geom_bar(aes(x=Gender, fill=NoShow))+
ggtitle("Gender by attendance")
ggplot(raw.data) +
geom_bar(aes(x=Gender, fill=NoShow), position='fill')+
ggtitle("Proportion Gender by attendance")
14 Create a similar plot using
SocialWelfare
# Plot the count of attendance by Social Welfare status
ggplot(raw.data) +
geom_bar(aes(x = SocialWelfare, fill = NoShow)) +
ggtitle("Attendance by Social Welfare")
# Plot the proportion of attendance by Social Welfare status
ggplot(raw.data) +
geom_bar(aes(x = SocialWelfare, fill = NoShow), position = 'fill') +
ggtitle("Proportion Attendance by Social Welfare")
Far more exploration could still be done, including dimensionality reduction approaches but although we have found some patterns there is no major/striking patterns on the data as it currently stands.
However, maybe we can generate some new features/variables that more
strongly relate to the NoShow.
Let’s begin by seeing if appointments on any day of the week has more
no-show’s. Fortunately, the lubridate library makes this
quite easy!
raw.data <- raw.data %>% mutate(AppointmentDay = wday(AppointmentDate, label=TRUE, abbr=TRUE),
ScheduledDay = wday(ScheduledDate, label=TRUE, abbr=TRUE))
ggplot(raw.data) +
geom_bar(aes(x=AppointmentDay, fill=NoShow)) +
ggtitle("Amount of No Show across Appointment Day")
ggplot(raw.data) +
geom_bar(aes(x=AppointmentDay, fill=NoShow), position = 'fill') +
ggtitle("Proportion of No Show across Appointment Day")
Let’s begin by creating a variable called
Lag, which is the
difference between when an appointment was scheduled and the actual
appointment.
raw.data <- raw.data %>% mutate(Lag.days=difftime(AppointmentDate, ScheduledDate, units = "days"),
Lag.hours=difftime(AppointmentDate, ScheduledDate, units = "hours"))
ggplot(raw.data) +
geom_density(aes(x=Lag.days, fill=NoShow), alpha=0.7)+
ggtitle("Density of Lag (days) by attendance")
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.
15 Have a look at the values in lag variable, does anything seem odd?
Let’s see how well we can predict NoShow from the data.
summary(raw.data$Lag.days)
## Length Class Mode
## 110527 difftime numeric
We’ll start by preparing the data, followed by splitting it into testing and training set, modeling and finally, evaluating our results. For now we will subsample but please run on full dataset for final execution.
### REMOVE SUBSAMPLING FOR FINAL MODEL
data.prep <- raw.data %>%
mutate(across(where(is.factor), as.character),
across(where(is.ordered), as.character)) %>%
select(-AppointmentID, -PatientID)
set.seed(42)
data.split <- initial_split(data.prep, prop = 0.7)
train <- training(data.split)
test <- testing(data.split)
Let’s now set the cross validation parameters, and add classProbs so we can use AUC as a metric for xgboost.
fit.control <- trainControl(method="cv",number=3,
classProbs = TRUE, summaryFunction = twoClassSummary)
16 Based on the EDA, how well do you think this is going to work?
Based on the EDA, some predictors (such as SMSReceived, age, lag time, and social welfare status) may be weakly associated with no-shows, but no variables seem to perfectly separate the classes. Therefore, we expect the XGBoost model to achieve moderate performance (e.g., AUC about 0.8), but not to be highly accurate.
Now we can train our XGBoost model
xgb.grid <- expand.grid(eta=c(0.05),
max_depth=c(4),colsample_bytree=1,
subsample=1, nrounds=500, gamma=0, min_child_weight=5)
library(recipes)
# One-hot encode all categorical predictors
rec <- recipe(NoShow ~ ., data = data.prep) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
# Prep and bake for train/test
prepped <- prep(rec, training = train)
train_ready <- bake(prepped, new_data = train)
test_ready <- bake(prepped, new_data = test)
# Train XGBoost model on one-hot encoded data
xgb.model <- train(NoShow ~ ., data = train_ready, method = "xgbTree",
metric = "ROC", tuneGrid = xgb.grid, trControl = fit.control)
xgb.pred <- predict(xgb.model, newdata = test_ready)
xgb.probs <- predict(xgb.model, newdata = test_ready, type = "prob")
# Ensure both predicted and actual labels are factor with same levels
test_ready$NoShow <- factor(test_ready$NoShow, levels = c("No", "Yes"))
xgb.pred <- factor(xgb.pred, levels = c("No", "Yes"))
confusionMatrix(xgb.pred, test_ready$NoShow, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 26396 6419
## Yes 131 213
##
## Accuracy : 0.8025
## 95% CI : (0.7981, 0.8067)
## No Information Rate : 0.8
## P-Value [Acc > NIR] : 0.1315
##
## Kappa : 0.0422
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.032117
## Specificity : 0.995062
## Pos Pred Value : 0.619186
## Neg Pred Value : 0.804388
## Prevalence : 0.200006
## Detection Rate : 0.006424
## Detection Prevalence : 0.010374
## Balanced Accuracy : 0.513589
##
## 'Positive' Class : Yes
##
This isn’t an unreasonable performance, but let’s look a bit more carefully at the correct and incorrect predictions,
# Ensure NoShow.numerical is available in test_ready
test_ready$NoShow.numerical <- ifelse(test_ready$NoShow == "Yes", 1, 0)
# Make sure xgb.probs has the correct columns for plotting
xgb.probs$Actual <- test_ready$NoShow.numerical
xgb.probs$ActualClass <- test_ready$NoShow
xgb.probs$PredictedClass <- xgb.pred
xgb.probs$Match <- ifelse(xgb.probs$ActualClass == xgb.probs$PredictedClass,
"Correct", "Incorrect")
xgb.probs$Match <- factor(xgb.probs$Match, levels = c("Incorrect", "Correct"))
# Plot (Yes = predicted probability for "Yes" class)
ggplot(xgb.probs, aes(x = Yes, y = Actual, color = Match)) +
geom_jitter(alpha = 0.2, size = 0.25) +
scale_color_manual(values = c("grey40", "orangered")) +
ggtitle("Visualizing Model Performance", "(Dust Plot)")
Finally, let’s close it off with the variable importance of our model:
results = data.frame(Feature = rownames(varImp(xgb.model)$importance)[1:10],
Importance = varImp(xgb.model)$importance[1:10,])
results$Feature = factor(results$Feature,levels=results$Feature)
# [4.10] Plot Variable Importance
ggplot(results, aes(x=Feature, y=Importance,fill=Importance))+
geom_bar(stat="identity")+
scale_fill_gradient(low="grey20",high="orangered")+
ggtitle("XGBoost Variable Importance")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
17 Using the caret package fit and evaluate 1 other ML model on this data.
For computational efficiency, we fitted a logistic regression (GLM) model using the same feature set. Logistic regression trains much faster on high-dimensional data compared to random forest and is a common baseline for classification tasks.
The logistic regression model fits very quickly, but the results show that it does not perform well at predicting no-shows. The overall accuracy is high, but this is mainly because most patients do show up for their appointments. The model struggles to correctly identify patients who will miss their appointment (very low sensitivity), and the dust plot shows many incorrect predictions for no-shows. This highlights how challenging it is to predict no-shows with the current data and simple models.
# Q17 Logistic Regression (Generalized Linear Model)
# Train the logistic regression model
set.seed(42)
glm.model <- train(
NoShow ~ ., # Use all one-hot encoded features
data = train_ready, # One-hot encoded training data
method = "glm", # Generalized Linear Model (logistic regression)
family = "binomial", # Binomial family for logistic regression
metric = "ROC", # Optimize using AUC
trControl = fit.control # Cross-validation settings (as before)
)
# Generate predictions and predicted probabilities on the test set
glm.pred <- predict(glm.model, newdata = test_ready)
glm.probs <- predict(glm.model, newdata = test_ready, type = "prob")
# Ensure labels are factors with the same levels
test_ready$NoShow <- factor(test_ready$NoShow, levels = c("No", "Yes"))
glm.pred <- factor(glm.pred, levels = c("No", "Yes"))
# Confusion matrix
conf_matrix_glm <- confusionMatrix(glm.pred, test_ready$NoShow, positive = "Yes")
print(conf_matrix_glm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 26290 6501
## Yes 237 131
##
## Accuracy : 0.7968
## 95% CI : (0.7924, 0.8011)
## No Information Rate : 0.8
## P-Value [Acc > NIR] : 0.9279
##
## Kappa : 0.0168
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.019753
## Specificity : 0.991066
## Pos Pred Value : 0.355978
## Neg Pred Value : 0.801744
## Prevalence : 0.200006
## Detection Rate : 0.003951
## Detection Prevalence : 0.011098
## Balanced Accuracy : 0.505409
##
## 'Positive' Class : Yes
##
# AUC calculation
test_ready$NoShow.numerical <- ifelse(test_ready$NoShow == "Yes", 1, 0)
library(pROC)
auc_glm <- auc(test_ready$NoShow.numerical, glm.probs[, "Yes"])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
paste("Logistic Regression Area under ROC Curve: ", round(auc_glm, 3), sep = "")
## [1] "Logistic Regression Area under ROC Curve: 0.666"
# create a dust plot
glm.probs$Actual <- test_ready$NoShow.numerical
glm.probs$ActualClass <- test_ready$NoShow
glm.probs$PredictedClass <- glm.pred
glm.probs$Match <- ifelse(glm.probs$ActualClass == glm.probs$PredictedClass,
"Correct", "Incorrect")
glm.probs$Match <- factor(glm.probs$Match, levels = c("Incorrect", "Correct"))
ggplot(glm.probs, aes(x = Yes, y = Actual, color = Match)) +
geom_jitter(alpha = 0.2, size = 0.25) +
scale_color_manual(values = c("grey40", "orangered")) +
ggtitle("Logistic Regression Model Performance", "(Dust Plot)")
18 Based on everything, do you think we can trust analyses based on this dataset? Explain your reasoning.
I think we need to be careful when trusting results from this dataset. Some important information is missing, like the type of appointment or details about patients’ health and backgrounds. There are also some strange data points, like negative lag times and impossible ages, which could affect the analysis. Since the data comes from one city and only covers a certain period, the results might not apply elsewhere. Overall, the dataset is useful for spotting general patterns, but I wouldn’t use it to make strong claims or decisions without more complete and accurate data.
This data is observational and subject to confounding factors not accounted for in the variables provided. For example, the apparent association between SMS reminders and higher no-show rates may reflect targeting of reminders to high-risk individuals, not a true causal relationship.